Combining translation and language model scoring for domain-specific data filtering
نویسندگان
چکیده
The increasing popularity of statistical machine translation (SMT) systems is introducing new domains of translation that need to be tackled. As many resources are already available, domain adaptation methods can be applied to utilize these recourses in the most beneficial way for the new domain. We explore adaptation via filtering, using the crossentropy scores to discard irrelevant sentences. We focus on filtering for two important components of an SMT system, namely the language model (LM) and the translation model (TM). Previous work has already applied LM cross-entropy based scoring for filtering. We argue that LM cross-entropy might be appropriate for LM filtering, but not as much for TM filtering. We develop a novel filtering approach based on a combined TM and LM cross-entropy scores. We experiment with two large-scale translation tasks, the Arabicto-English and English-to-French IWSLT 2011 TED Talks MT tasks. For LM filtering, we achieve strong perplexity improvements which carry over to the translation quality with improvements up to +0.4% BLEU. For TM filtering, the combined method achieves small but consistent improvements over the standalone methods. As a side effect of adaptation via filtering, the fully fledged SMT system vocabulary size and phrase table size are reduced by a factor of at least 2 while up to +0.6% BLEU improvement is observed.
منابع مشابه
Translation Strategies of Culture-Specific Items from English to Persian in Translation of "Othello"
This study investigated the translation strategies of culture-specific items in translation of 'Othello' by William Shakespeare into Persian by Abdolhossein Nooshin. First, the English culture-specific items and their corresponding translations were identified. Then, the frequency of the strategies used by the translator according to Newmark's translation model and Venuti's domestication and fo...
متن کاملCultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus
Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...
متن کاملCultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus
Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...
متن کاملCultural Frame and Translation of Pronominal Adverbs in Legal English
This paper explores the relationship between cultural knowledge and the specific meaning of a pronominal adverb in legal English where Chinese translators need to get the correct translation in their venture into translating the language of law. On the one hand, relying on the relevant legal cultural knowledge functioning as domain-general reference within a community or jurisdiction, tra...
متن کاملDomain Adaptation for Medical Text Translation using Web Resources
This paper describes adapting statistical machine translation (SMT) systems to medical domain using in-domain and general-domain data as well as webcrawled in-domain resources. In order to complement the limited in-domain corpora, we apply domain focused webcrawling approaches to acquire indomain monolingual data and bilingual lexicon from the Internet. The collected data is used for adapting t...
متن کامل